Skip to content

refactor: separate statistic computation#411

Open
tristan-f-r wants to merge 23 commits into
mainfrom
lazy-stats
Open

refactor: separate statistic computation#411
tristan-f-r wants to merge 23 commits into
mainfrom
lazy-stats

Conversation

@tristan-f-r

@tristan-f-r tristan-f-r commented Oct 10, 2025

Copy link
Copy Markdown
Collaborator

We also make graph statistics lazy. Laziness isn't used in summary.py, but I assume that we'll have more computationally expensive graph statistics as SPRAS develops, especially when it can take long to compute for our larger graphs, so this also splits up statistic generation into different rules.

Most importantly, this allows us to re-use statistics by consuming specific statistics as input files, which is currently used in #431.

  • Depends on feat!: SPRAS revision #320 for the summary statistics test (integration testing over unit testing is now required since the heavy workflow lifting is done by Snakemake).

@tristan-f-r tristan-f-r added tuning Workflow-spanning algorithm tuning refactor Changes that don't actually improve anything except for code quality. labels Oct 10, 2025
@read-the-docs-community

read-the-docs-community Bot commented Oct 10, 2025

Copy link
Copy Markdown

Documentation build overview

📚 spras | 🛠️ Build #32474691 | 📁 Comparing 9053a59 against latest (caf8e9e)

  🔍 Preview build  

4 files changed
± genindex.html
± contributing/maintain.html
± fordevs/spras.analysis.html
± fordevs/spras.html

@tristan-f-r tristan-f-r requested a review from ntalluri October 14, 2025 17:38
@tristan-f-r tristan-f-r added the P-medium medium prirotity; this is needed for some external service or another PR label Oct 14, 2025
@tristan-f-r

tristan-f-r commented Oct 14, 2025

Copy link
Copy Markdown
Collaborator Author

Building on top of this PR allows me to add graph heuristics.

Most likely, every tuning PR will be at least marked with P-medium unless it's an end result.

@github-actions github-actions Bot added the merge-conflict This PR has merge conflicts. label Oct 30, 2025
@tristan-f-r tristan-f-r mentioned this pull request Oct 30, 2025
1 task
@github-actions github-actions Bot removed the merge-conflict This PR has merge conflicts. label Oct 30, 2025
@agitter

agitter commented Nov 7, 2025

Copy link
Copy Markdown
Collaborator

Before I can review the implementation of the change, I need to better understand what problem we are tying to solve with the change. Where will laziness be needed in the future?

we can reuse the code for graph heuristic pruning

Do we envision calling graph statistic computation twice per graph? After we compute these statistics on a graph once, shouldn't that be sufficient for an entire pass of a workflow?

@tristan-f-r

Copy link
Copy Markdown
Collaborator Author

I was going to ask @ntalluri about this, since I wasn't quite sure if we will have expensive graph heuristics or not.

Do we envision calling graph statistic computation twice per graph? After we compute these statistics on a graph once, shouldn't that be sufficient for an entire pass of a workflow?

I did decouple this from analysis: summary: enabled: true, and I imagined it like this. I didn't think about that, though: would it make sense to have graph summary statistics always enabled the moment any heuristics are enabled?

@agitter

agitter commented Nov 8, 2025

Copy link
Copy Markdown
Collaborator

There could be more than one way to design this sensibly. One would be that if heuristics are enabled in the config file, that automatically generates the graph summary table. The produces more output than requested, which is slightly undesirable.

Another could be to move the heuristic calculations inside each --parameters> subdirectory, which may be where you are headed. If that is written as a file for that one pathway, it could be consumed for heuristics (or used for heuristics and then written to disk). Later, if the graph summary table is requested, it would grab the precomputed statistics from those files in the subdirectories.

@tristan-f-r

Copy link
Copy Markdown
Collaborator Author

I'll mark this as a draft for now and design something in line with your second proposal.

@tristan-f-r tristan-f-r marked this pull request as draft November 8, 2025 08:06
@ntalluri ntalluri removed the P-medium medium prirotity; this is needed for some external service or another PR label Nov 19, 2025
@tristan-f-r tristan-f-r added the P-high This is a blocker for many PRs/issues/features label Jan 13, 2026
@tristan-f-r tristan-f-r marked this pull request as ready for review January 13, 2026 20:13
@tristan-f-r tristan-f-r removed the P-high This is a blocker for many PRs/issues/features label Jan 13, 2026
@tristan-f-r tristan-f-r added blocked-by-other-pr P-medium medium prirotity; this is needed for some external service or another PR labels Jan 13, 2026
@ntalluri

ntalluri commented Feb 5, 2026

Copy link
Copy Markdown
Collaborator

Would you be able to explain what the goal and what the changes are of this PR in the top comment? Also why does this depend on SPRAS revision?

@tristan-f-r

tristan-f-r commented Feb 5, 2026

Copy link
Copy Markdown
Collaborator Author

I've edited the top comment to mention the heuristics PR 👍, though the motivation was already present.

As mentioned in the meeting and in the top-level comment, this depends on the integration testing part of the SPRAS revision and not the immutability section.

@agitter agitter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new design. I'm not confident I follow everything in spras/statistics.py correctly or how the testing updates intersect with #320.

Comment thread spras/analysis/summary.py Outdated
Comment thread spras/config/config.py Outdated
Comment thread spras/statistics.py Outdated
Comment thread spras/statistics.py
Comment thread spras/statistics.py Outdated
Comment thread spras/analysis/summary.py
Comment thread test/analysis/test_cytoscape.py
Comment thread Snakefile Outdated
Comment thread Snakefile
@github-actions github-actions Bot added the merge-conflict This PR has merge conflicts. label Mar 16, 2026
@github-actions github-actions Bot removed the merge-conflict This PR has merge conflicts. label Mar 19, 2026
@tristan-f-r tristan-f-r mentioned this pull request Mar 26, 2026
@ntalluri

Copy link
Copy Markdown
Collaborator

Is there a reason why we need to have separate statistic folders within each output subnetwork folder? I don't understand this motivation.

@ntalluri

Copy link
Copy Markdown
Collaborator

I read the conversion between you and tony, I see the motivation now.

@ntalluri ntalluri left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am running this locally still, but this is my current review

Comment thread spras/statistics.py
return max(degrees), median(degrees)

def compute_on_cc(directed_graph: nx.DiGraph) -> tuple[int, float]:
# We convert our directed_graph to an undirected graph as networkx (reasonably) does

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember why we do this. @agitter I remember we talked about this years ago.

@tristan-f-r tristan-f-r Apr 29, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[I'm a little confused - is it why we compute the number of connected components in the first place?]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's about the undirected graph conversion, that comment should be why.

@agitter agitter Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ntalluri was your comment asking why we convert to undirected for the connected component calculation?

Related to my comment above, I recommend we stay with reading undirected graphs and use them throughout. That affects other statistics like degree as well.

Comment thread spras/analysis/summary.py
Comment thread spras/statistics.py Outdated
Comment thread spras/analysis/summary.py
Comment thread Snakefile
Comment thread spras/statistics.py
Comment thread Snakefile
Comment thread Snakefile
Comment thread spras/statistics.py Outdated
@tristan-f-r tristan-f-r requested a review from ntalluri April 30, 2026 03:06

@agitter agitter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally understand the design better and like it.

The tests don't pass for me locally

FAILED test/analysis/test_summary.py::TestSummary::test_example_networks[example] - AssertionError: assert False
FAILED test/analysis/test_summary.py::TestSummary::test_example_networks[egfr] - AssertionError: assert False

Comment thread spras/analysis/summary.py

def summarize_networks(file_paths: Iterable[Path], node_table: pd.DataFrame, algo_params: dict[str, dict],
algo_with_params: list[str]) -> pd.DataFrame:
algo_with_params: list[str], statistics_files: Mapping[str, Iterable[str | os.PathLike]]) -> pd.DataFrame:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have LoosePathLike in util. Does it make sense to use it here?

Comment thread spras/statistics.py
Comment on lines +8 to +10
To make the statistics allow directed graph input, they will always take
in a networkx.DiGraph, which contains even more information, even though
the underlying graph may be just as easily represented by networkx.Graph.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a change in functionality? In summary.py we had the opposite

Network directionality is ignored and all edges are treated as undirected

When we previously assessed treating all graphs as directed or undirected, we decided that undirected would be less wrong.

Comment thread spras/statistics.py
statistics_computation: dict[tuple[str, ...], Callable[[nx.DiGraph], tuple[float | int, ...]]] = {
('Number of nodes',): lambda graph : (graph.number_of_nodes(),),
('Number of edges',): lambda graph : (graph.number_of_edges(),),
('Number of connected components',): lambda graph : (nx.number_connected_components(graph.to_undirected()),),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, having to convert everything to undirected is messy.

Comment thread spras/statistics.py
return (avg_path_len,)

# The type signature here is meant to be 'an n-tuple has n outputs.'
statistics_computation: dict[tuple[str, ...], Callable[[nx.DiGraph], tuple[float | int, ...]]] = {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This syntax is somewhat confusing. It took me a minute to realize why everything needs to be a tuple, that sometimes we have a function return one statistic and sometimes multiple. The design makes sense now that I get it, but a comment would have helped.

Comment thread spras/statistics.py
# All of the keys inside statistics_computation, flattened.
statistics_options: list[str] = list(itertools.chain(*(list(key) for key in statistics_computation.keys())))

def from_output_pathway(lines) -> nx.Graph:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I saw it in summary.py, I didn't recognize this function as a graph loader from the name.

subprocess.run(["snakemake", "--cores", "1", "--configfile", f"test/analysis/input/{param}.yaml"])
yield param # this runs the test itself: once this is passed, we go to test cleanup.
shutil.rmtree(f"test/analysis/input/run/{param}")
# shutil.rmtree(f"test/analysis/input/run/{param}")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not need this anymore?

Comment thread Snakefile

# We generate new Snakemake rules for every statistic
# to allow parallel and lazy computation of individual statistics
for keys in statistics_computation.keys():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the complicated typing, I find it hard to track what keys is here. I'm pretty sure it is the tuples describing the statistics as strings.

Comment thread Snakefile
# (See https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#procedural-rule-definition)
name: pythonic_name
input: pathway_file = rules.parse_output.output.standardized_file
output: [SEP.join([out_dir, '{dataset}-{algorithm}-{params}', 'statistics', f'{key}.txt']) for key in keys]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to sanitize the keys here as done in the pythonic_name above? When I run locally, I see output files like Number of connected components.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P-medium medium prirotity; this is needed for some external service or another PR refactor Changes that don't actually improve anything except for code quality. tuning Workflow-spanning algorithm tuning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants